{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Splitting data\n",
    "\n",
    "In the previous [notebook](https://github.com/hundredblocks/ml-powered-applications/blob/master/notebooks/dataset_exploration.ipynb), we explored a dataset. Next, we will separate it into a training and test split. Separating a dataset into splits is crucial to validate the performance of a model. By using only a subset of data to train a model, you can use unseen data to produce an estimate of how your model would perform in practice.\n",
    "\n",
    "In this notebook, I will demonstrate a few ways to do just that using the `writers` Stack Overflow dataset. First, we load and format data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import spacy\n",
    "import umap\n",
    "import numpy as np \n",
    "from pathlib import Path\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "from ml_editor.data_processing import format_raw_df, get_random_train_test_split, get_vectorized_inputs_and_label, get_split_by_author\n",
    "\n",
    "data_path = Path('../data/writers.csv')\n",
    "df = pd.read_csv(data_path)\n",
    "df = format_raw_df(df.copy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Split\n",
    "\n",
    "The simplest way to generate a test set is to randomly split data between a training set and a test set. This is what we do below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df_rand, test_df_rand = get_random_train_test_split(df[df[\"is_question\"]], test_size=0.3, random_state=40)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5579 questions in training, 2392 in test.\n",
      "5579 different owners in the training set\n",
      "2392 different owners in the testing set\n",
      "596 owners appear in both sets\n"
     ]
    }
   ],
   "source": [
    "print(\"%s questions in training, %s in test.\" % (len(train_df_rand),len(test_df_rand)))\n",
    "train_owners = set(train_df_rand['OwnerUserId'].values)\n",
    "test_owners = set(test_df_rand['OwnerUserId'].values)\n",
    "print(\"%s different owners in the training set\" % len(train_df_rand))\n",
    "print(\"%s different owners in the testing set\" % len(test_df_rand))\n",
    "print(\"%s owners appear in both sets\" % len(train_owners.intersection(test_owners)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach comes with one drawback, can you guess what it is before moving on to the next section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Author Split\n",
    "\n",
    "Some authors may be more skilled at asking questions then others. If an author appears in both the training and the test set, a model could successfully predict the performance of their questions simply by successfully identifying the author. Note that simply removing the `AuthorId` from the set of features does not fully solve this problem, as the formulation of a question may be author specific (especially if some authors include their signature). \n",
    "\n",
    "To make sure we are accurately judging question quality, we would want to make sure that a given author only appears in either the training set or the validation set. This guarantee that a model will not be able to leverage information to identify a given author and use it to predict more easily.\n",
    "\n",
    "To remove this potential source of bias, let's split data by author."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5676 questions in training, 2295 in test.\n",
      "2723 different owners in the training set\n",
      "1167 different owners in the testing set\n",
      "0 owners appear in both sets\n"
     ]
    }
   ],
   "source": [
    "train_author, test_author = get_split_by_author(df[df[\"is_question\"]], test_size=0.3, random_state=40)\n",
    "\n",
    "print(\"%s questions in training, %s in test.\" % (len(train_author),len(test_author)))\n",
    "train_owners = set(train_author['OwnerUserId'].values)\n",
    "test_owners = set(test_author['OwnerUserId'].values)\n",
    "print(\"%s different owners in the training set\" % len(train_owners))\n",
    "print(\"%s different owners in the testing set\" % len(test_owners))\n",
    "print(\"%s owners appear in both sets\" % len(train_owners.intersection(test_owners)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Going forward we will use the author split, but there are other methods of splitting data for other types of data. For example, we may want to use a time-based split in order to see whether training on questions written in a given period can produce a model that works well on questions from a more recent period. Please refer to the book for more information on those."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml_editor",
   "language": "python",
   "name": "ml_editor"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
